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Stephen Uhler's HTML parser in 10 lines, originally by Stephen Uhler, previously 
known as HTML parser in 8 lines of Tcl, and currently known as HTML parser in 4 
lines of Tcl, is a small toy HTML parser. It's not correct in that it can get messed up by 


angle brackets in attribute values and unbalanced braces in the HTML content, but it's an 
interesting code snippet nonetheless. 


Attributes 


location (defunct) 


type=text/vnd.viewcvs-markup 


Description 


Stephen Uhler's HTML parser in 8 lines is now actually in 4 lines. Here is the current 
version: 


HHPHHHHHHHHHHHHHEHHHHRHHHERH HEH eH ee eH ot 
# Turn HTML into TCL commands 


# html A string containing an html document 
# cmd A command to run for each html tag found 
# start The name of the dummy html start/stop tags 


proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} { 
set exp {<(/?)([4 \t\r\n>]+)[ \t\r\n]*([4>]*)>} 
set sub "\}\n[list $cmd] {\\2} {\\1} {\\3} \{" 
regsub -all $exp [string map {\{ \&ob; \} \&cb;} $html] $sub html 
eval "$cmd {$start} {} {} \{ $html \}; $cmd {$start} / {} {}" 
z 


But it was missing the default value for cmd, HMtest_parse, so | wrote one and applied it 
to a sample bit of HTML: 


1/5 


proc HMtest_parse {tag state props body} { 
if {$state eq {}} { 
set msg "Start $tag" 
if {$props ne {}} { 
set msg "$msg with args: $props" 
} 
set msg "$msg\n$body" 
} else { 
set msg "End $tag" 
} 
puts $msg 


HMparse_html { 

<html> 
<p class="bubba"> 
This is my very first paragraph. How do you 
like it? I think it has a lot to recommend it. 
</p> 
<p class="louielouie"> 
This is my second paragraph, which is OK, 
but not as nice as my first one. 
</p> 

</html> 


Output: 


Start hmstart 


Start html 


Start p with args: class="bubba" 


This is my very first paragraph. How do you 
like it? I think it has a lot to recommend it. 


End p 
Start p with args: class="louielouie" 


This is my second paragraph, which is OK, 
but not as nice as my first one. 


End p 
End html 
End hmstart 


In fact, the code is not HTML-specific, and can handle simple XML code (e.g., that 
doesn't use the self-closing <tag/> format). It's like a mini-SAX. (Actually, it isn't quite like 
SAX. It's only like it because you define handlers for each tag. But unlike SAX it operates 
on a string in memory and doesn't execute until everything has been converted.) I've 
created a small XML parser based on this code and put it in TAX: A Tiny API for XML. 
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PYK: the following comment refers to a previous, longer version of the parser 
which can be found in the history for this page 


In spite of its incredible (to me) brevity, the code can actually be shortened somewhat. 
The proc HMcl is introduced in order to avoid trouble with [ ]'s. But it can also be avoided 
by enclosing the value of exp in { }'s. Also, the variable w doesn't need to be defined (at 
least in recent Tcl versions): \s can be used instead. Here's the new HMparse_html proc: 


proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} { 
regsub -all \{ $html {\&ob;} html 
regsub -all \} $html {\&cb;} html 
set exp {<(/?)([4\s>]+)\s*([4>]*)>} 
set sub "\}\n$cmd {\\2} {\\1} {\\3} \{" 
regsub -all $exp $html $sub html 
eval "$cmd {$start} {} {} \{ $html \}" 
eval "$cmd {$start} / {} {}" 


OK, one more thing... If the cmd is an ensemble, then the different tags can be sub-procs 
within the ensemble. For example, just like string length is a command, where string is 
the ensemble, and /ength is the sub-proc, it should be possible to set up cmd so that cmd 
p would invoke the proc for parsing p tags, cmd html would invoke the command for 
parsing html tags, etc. 


It's pretty easy to create ensembles in snit, so here's a snit version: 
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package require snit 


HHHHHHHHHHHHHHHEHAHHAHHEHAHHEHAHHAHHAHAHE ABS 
# Turn HTML into TCL commands 


# html 
# cmd 
# start 


A string containing an html document 


A command to run for each html tag found 
The name of the dummy html start/stop tags 


proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} { 


} 


regsub -all \{ $html {\&ob;} html 
regsub -all \} $html {\&cb;} html 

set exp {<(/?)([4\s>]+)\s*([4>]*)>} 

set sub "\}\n$cmd {\\2} {\\1} {\\3} \{" 
regsub -all $exp $html $sub html 

eval "$cmd {$start} {} {} \{ $html \}" 
eval "$cmd {$start} / {} {}" 


snit::type parser { 


J 


proc isend {state} { 
if {$state eq {}} { 
return false 
} else { 
return true 


} 
method hmstart {args} {} 
method html {state args} f{ 
if [isend $state] { 
puts {That's all, folks!} 
} else { 
puts {Let's get going! } 


i 
method p {state props body} { 


if {![isend $state]} {puts $body} 


parser HMtest_parse 


HMparse_html { 


<html> 
<p class="bubba"> 
This is my very first paragraph. How do you 


like it? I think it has a lot to recommend it. 


</p> 

<p class="louielouie"> 

This is my second paragraph, which is OK, 
but not as nice as my first one. 


</p> 
</html> 
} 
Output: 
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Let's get going! 
This is my very first paragraph. How do you 
like it? I think it has a lot to recommend it. 
This is my second paragraph, which is OK, 


but not as nice as my first one. 


That's all, folks! 


The problem with using snit (or incr tcl is you have to declare handlers for all tags or you 
will end up with a runtime error (for example "method body not found"). | myself use the 
following mechanism with some success: 


proc HMtest_parse {tag state props body} { 
if {[info proc handle_$tag] ne {}} { 
handle_$tag $state $props $body 


proc handle_a {state props body} { ... } 
proc handle_img {state props body} { ... } 


This way, you only have to declare handlers for the tags that you care about. 


HD: Actually, Snit allows you to define a method that receives all unknown methods: 


delegate method * using {%S UnknownMethod %m} 


method UnknownMethod {methodName args} { ... } 
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